Code
reticulate::repl_python()exit
Linear Regression vs. Graph Neural Networks
Slides: slides.html ( Go to slides.qmd to edit)
Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.
Carefully read this template since it has instructions and tips to writing!
Accurate weather prediction is a crucial task with widespread implications across agriculture, transportation, disaster preparedness, and energy management. Traditional forecasting methods often rely on statistical models or physics-based simulations, however, with the advancement of graphical neural networks (GNN) we believe there is potential in a more modern deep learning approach.
In this project, we explore the predictive power of a traditional linear regression model and a GNN on real-world weather station data. Our aim is to evaluate whether the GNN’s ability to incorporate spatial relationships between stations offers a measurable advantage over more conventional techniques
The dataset consists of multiple weather stations located within the same geographic region. Each station collects meteorological variables over time, and can be represented as a node within a broader spatial network. For the linear model baseline, a single model will be trained using all stations’ data simultaneously, treating each station as an independent feature source.
For the GNN the model will be trained on the entire network of stations, where each node corresponds to a station and edges represent spatial relationships. The graph is encoded via a dense adjacency matrix, excluding self-connections. The GNN aims to leverage the inherent spatial structure of the data, potentially capturing regional weather patterns and inter-station dependencies that are invisible to traditional models.
Our evaluation focuses on forecasting performance over a 6-month test period at the end of the dataset. We asses how well each modelling approach predicts key weather variables and investigate the conditions under which one model may outperform the other.
This section outlines the modeling approaches, data structure, and training procedures used to compare the traditional linear model and the GNN on weather station data.
Work in progress
Work in progress
The linear model is formulated as a time-series regression task. It uses the feature information from the previous four time steps to predict the feature values at the next time step. Each input consists of a concatenation of the five meteorological features across four sequential time steps, resulting in a fixed length input vector per prediction target. The five input features are:
The GNN is designed to capture spatiotemporal dependencies in the weather station network. It is implemented using PyTorch and follows a structure inspired by the Diffusion Convolutional Recurrant Neural Network (DCRNN) architecture.
where each node represents a weather station and temporal sequences of node features are used for prediction.
The model is trained to predict the same five features (temperature, relative humidity, wind speed, wind direction sin, wind direction cosine) for the next time step based on the preceding four time steps, analogous to the linear model.
Describe your data sources and collection process.
Present initial findings and insights through visualizations.
Highlight unexpected patterns or anomalies.
A study was conducted to determine how…
reticulate::repl_python()exit
# %pip install polars==1.22.0
# %pip install numpy
# %pip install pandas
# %pip install seaborn
# %pip install matplotlib
# %pip install pyarrow
# %pip install datetime
# %pip install contextily
# %pip install h3==3.7.7
# %pip install shapely
# %pip install geopandas
# %pip install scikit-learn
# %pip install tqdm
# %pip install hypothesis==6.135.26# Standard library imports (e.g., OS, datetime, etc.)
import os
import typing
import datetime
from pathlib import Path
import shutil
# Third-party libraries
import numpy as np
import polars as pl
import seaborn as sns
import pandas as pd
from polars.testing.parametric import columns
from sklearn.preprocessing import minmax_scale, robust_scale
from tqdm import tqdm
import geopy.distance
import scipy
import sklearn.preprocessing as pre
import geopandas
import shapely
from shapely.geometry import Polygon, Point, LineString
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import contextily as ctx
from h3 import h3import datetime
import polars as pl
start_date = datetime.datetime(2010, 1, 1, 0, 0)
end_date = datetime.datetime(2020, 12, 31, 0, 0)
seed = 3435
pl.enable_string_cache()
data_path = r'kansas_asos_2010_2020.csv'metar_schema = {'station': pl.Categorical,
'valid': pl.Datetime,
'lon': pl.Float64,
'lat': pl.Float64,
'elevation': pl.Float64,
'tmpf': pl.Float64,
'dwpf': pl.Float64,
'relh': pl.Float64,
'drct': pl.Float64,
'sknt': pl.Float64,
'p01i': pl.Float64,
'alti': pl.Float64,
'mslp': pl.Float64,
'vsby': pl.Float64,
'gust': pl.Float64,
'skyc1': pl.Categorical,
'skyc2': pl.Categorical,
'skyc3': pl.Categorical,
'skyc4': pl.Categorical,
'skyl1': pl.Float64,
'skyl2': pl.Float64,
'skyl3': pl.Float64,
'skyl4': pl.Float64,
'wxcodes': pl.String,
'ice_accretion_1hr': pl.Float64,
'ice_accretion_3hr': pl.Float64,
'ice_accretion_6hr': pl.Float64,
'peak_wind_gust': pl.Float64,
'peak_wind_drct': pl.Float64,
'peak_wind_time': pl.Datetime,
'feel': pl.Float64,
'metar': pl.String,
'snowdepth': pl.Float64}asos_ldf = pl.scan_csv(data_path, null_values=['T', 'M', '///'], schema=metar_schema)\
.drop('metar')\
.with_columns(pl.col('valid').dt.round('1h').alias('valid'))import numpy as np
full_date_series = np.arange(start_date, end_date, datetime.timedelta(hours=1))
asos_df = asos_ldf\
.collect()\
.select(pl.col('station', 'lat', 'lon', 'elevation'))\
.unique()\
.join(pl.DataFrame({'valid': full_date_series}), how='cross')\
.join(asos_ldf.collect(), on=['station', 'valid'], how='left')\
.with_columns(pl.col('valid').dt.round('6h').alias('valid'))\
.drop('lat_right', 'lon_right', 'elevation_right')\
.group_by(['station', 'valid'])\
.mean()\
.with_columns(pl.col(pl.Float64).cast(pl.Float32))potential_features = asos_ldf.drop('valid', 'station', 'lat', 'lon', 'elevation').collect_schema().names()
feature_list = []
for feature in potential_features:
if not asos_df.select(pl.col(feature).is_null().all()).item():
feature_list.append(feature)
stations_list = asos_df\
.select(pl.col('station'))\
.unique()\
.to_series()\
.to_list()from tqdm import tqdm
def safe_index(item, lst):
return item in lst
year_series = np.arange(start_date.year, end_date.year + 1, 1)
reduced_feature_df = pl.DataFrame(schema={**{'start_year': pl.Int64, 'end_year': pl.Int64, 'year_range': pl.Int64, 'year_label': pl.String, 'feature': pl.String},
**{station: pl.Boolean for station in stations_list}
})
for index, a in enumerate(tqdm(year_series)):
for b in year_series[index:]:
shifted_date_series = np.arange(datetime.date(a, 1, 1), datetime.date(b, 12, 31), datetime.timedelta(hours=6))
year_filter_df = asos_df.filter(pl.col('valid').dt.year().is_between(a, b))
for feature in feature_list:
if (year_filter_df.select(pl.col(feature).null_count()) != year_filter_df.height).item():
asos_pivot_df = year_filter_df\
.pivot(on='station', index='valid', values=feature, aggregate_function='mean')\
.drop('valid')
valid_stations = [s.name for s in asos_pivot_df if not (s.null_count() > len(shifted_date_series)*0.1)]
if len(valid_stations) >= 6:
if asos_pivot_df.var().select(pl.mean_horizontal(pl.all()).alias('mean')).item() > 5:
valid_stations.sort()
new_row = pl.DataFrame({**{'start_year': a,
'end_year': b,
'year_range': (b+1) - a,
'year_label': f'{a}-{b}',
'feature': feature},
**{station: safe_index(station, valid_stations) for station in stations_list}
})
reduced_feature_df = reduced_feature_df.vstack(new_row)
0%| | 0/11 [00:00<?, ?it/s]
9%|9 | 1/11 [00:01<00:11, 1.11s/it]
18%|#8 | 2/11 [00:02<00:08, 1.00it/s]
27%|##7 | 3/11 [00:02<00:07, 1.11it/s]
36%|###6 | 4/11 [00:03<00:05, 1.25it/s]
45%|####5 | 5/11 [00:03<00:04, 1.44it/s]
55%|#####4 | 6/11 [00:04<00:02, 1.67it/s]
64%|######3 | 7/11 [00:04<00:02, 1.96it/s]
73%|#######2 | 8/11 [00:04<00:01, 2.37it/s]
82%|########1 | 9/11 [00:05<00:00, 2.90it/s]
91%|######### | 10/11 [00:05<00:00, 3.71it/s]
100%|##########| 11/11 [00:05<00:00, 2.09it/s]
import matplotlib.pyplot as plt
import seaborn as sns
features = reduced_feature_df.select(pl.col('feature')).unique().to_series().to_list()
n_cols = 3
n_rows = -(-len(features) // n_cols) # Ceiling division
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 3), constrained_layout=True,
sharex=True, sharey=True)
axes = axes.flatten()
for idx, feature in enumerate(features):
plot_df = (
reduced_feature_df
.filter(pl.col('feature') == feature)
.drop('feature', 'start_year', 'end_year', 'year_range')
.to_pandas()
)
plot_df.index = plot_df['year_label'].astype(str) # Ensure index is str, not int
plot_df = plot_df.drop('year_label', axis=1)
plot_df = plot_df.sort_index()
ax = axes[idx]
sns.heatmap(plot_df, cmap='magma', annot=True, cbar=False, ax=ax)
ax.set_title(feature)
# Shared axis labels
fig.suptitle("Station Participation by Feature and Year", fontsize=16)
fig.supxlabel("Stations")
fig.supylabel("Year")
plt.show()df_t = reduced_feature_df\
.filter(pl.col('year_label').eq('2018-2020'))\
.drop('year_range', 'year_label', 'start_year', 'end_year')\
.transpose(include_header=True)
new_headers = df_t.row(0)
# Drop the first row
df_t_no_header = df_t.slice(1, df_t.height - 1)
# Assign new headers
df_t_renamed = df_t_no_header.rename({old: str(new) for old, new in zip(df_t_no_header.columns, new_headers)})valid_features = [col for col in df_t_renamed.columns if col != "feature"]
valid_stations = df_t_renamed\
.with_columns(pl.col(valid_features)\
.map_elements(lambda x: x == 'true', return_dtype=pl.Boolean))\
.filter( pl.all_horizontal([pl.col(col) for col in valid_features]))\
.select(pl.col('feature'))\
.to_series()\
.to_list()reduced_asos_df = asos_df\
.filter(pl.col('valid').is_between(datetime.datetime(2018, 1, 1, 0, 0), datetime.datetime(2021, 1, 1, 0, 0)))\
.filter(pl.col('station').is_in(valid_stations))\
.select(pl.col(['station', 'valid', 'lat', 'lon', 'elevation'] + valid_features))\
.sort(['valid', 'station'])\
.with_columns([(pl.col('drct').map_elements(lambda x: np.sin(np.radians(x)), return_dtype=pl.Float64)).alias('drct_sin'),
(pl.col('drct').map_elements(lambda x: np.cos(np.radians(x)), return_dtype=pl.Float64)).alias('drct_cos')])\
.drop('drct')
row_count, feature_count = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').shape
valid_station = reduced_asos_df.select(pl.col('station')).head(7).to_series().to_list()
station_count = len(valid_station)
valid_features = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').columns
# shape is time, station, feature
station_matrix = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').to_numpy().reshape(int(row_count/station_count), station_count, feature_count)reduced_asos_df| station | valid | lat | lon | elevation | tmpf | dwpf | relh | sknt | feel | drct_sin | drct_cos |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | datetime[μs] | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f64 | f64 |
| "GCK" | 2018-01-01 00:00:00 | 37.927502 | -100.724403 | 881.0 | 9.283334 | -6.5 | 48.148335 | 8.666667 | -4.161667 | 0.851117 | 0.524977 |
| "LBL" | 2018-01-01 00:00:00 | 37.044201 | -100.9599 | 879.0 | 12.316667 | -1.983333 | 52.014999 | 10.166667 | -1.721667 | 0.664796 | 0.747025 |
| "EHA" | 2018-01-01 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 15.555555 | 5.255556 | 63.382778 | 7.388889 | 4.482222 | 0.970763 | 0.24004 |
| "HQG" | 2018-01-01 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 14.311111 | -1.605556 | 48.468334 | 7.777778 | 2.681667 | 0.936332 | 0.351115 |
| "3K3" | 2018-01-01 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 13.1 | -0.9 | 53.127777 | 6.777778 | 2.151111 | 0.981255 | 0.192712 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "EHA" | 2020-12-31 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 42.355556 | 16.588888 | 34.955555 | 3.0 | 40.254444 | -0.533205 | -0.845986 |
| "HQG" | 2020-12-31 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 40.722221 | 13.255555 | 32.23111 | 1.666667 | 39.450001 | 0.977334 | -0.211704 |
| "3K3" | 2020-12-31 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 40.200001 | 12.2 | 31.377777 | 4.555555 | 36.683334 | -0.700217 | -0.71393 |
| "JHN" | 2020-12-31 00:00:00 | 37.578201 | -101.7304 | 1012.710022 | 40.711113 | 17.4 | 38.554443 | 4.777778 | 37.06889 | -0.824675 | -0.565607 |
| "19S" | 2020-12-31 00:00:00 | 37.496899 | -100.832901 | 892.570007 | 39.922222 | 15.388889 | 36.552223 | 6.111111 | 35.113335 | -0.824675 | 0.565607 |
# as this is true it means there are no time slices where all values are nan
not np.any(np.all(np.isnan(station_matrix), axis=(1, 2)))True
import geopy.distance
# calculate the geodesic distance between two nodes
def compute_node_distance(node1, node2, inverse=False):
coords_1 = (node1[1], node1[0])
coords_2 = (node2[1], node2[0])
# elevation_difference = abs(node1[2] - node2[2])
horizontal_distance = geopy.distance.geodesic(coords_1, coords_2).km
# return math.sqrt(horizontal_distance**2 + (elevation_difference/1000)**2)
if inverse:
try:
horizontal_distance = 1/horizontal_distance
except ZeroDivisionError:
horizontal_distance = 0
return horizontal_distancestation_df = reduced_asos_df.select(pl.col(['station', 'lon', 'lat', 'elevation'])).unique().to_pandas()
grid_list = station_df.loc[:, ['lon', 'lat']].reset_index()[['lon', 'lat', 'index']].to_numpy().tolist()
grid_list = [sublist[:-1] + [int(sublist[-1])] for sublist in grid_list]
result_dict = {'index': [],
'station': []}
result_dict = {**result_dict, **{str(i): [] for _, _, i in grid_list}}
# find the nearest mesh node for every station node
for row_index, station in station_df.iterrows():
for col_index, station2 in station_df.iterrows():
# To get elevation with the new system I would have to find a heightmap for the new lon/lat coordinates
result_dict[str(col_index)].append(compute_node_distance([station['lon'], station['lat']], [station2['lon'], station2['lat']], inverse=True))
result_dict['station'].append(station['station'])
result_dict['index'].append(row_index)
# # generate a dataframe that maps every station node to its relative mesh node with the inverse distance
grid_map_df = pl.DataFrame(result_dict, schema={**{'station': pl.Categorical, 'index': pl.UInt64}, **{str(i): pl.Float64 for _, _, i in grid_list}})from sklearn.preprocessing import MinMaxScaler, minmax_scale
scaled_idistance = minmax_scale(grid_map_df.drop('station', 'index'))
modified_adjacency_df = grid_map_df.select(pl.col(['station', 'index']))\
.join(pl.DataFrame(scaled_idistance, schema=[str(i) for i, _ in enumerate(scaled_idistance)]).with_row_index(),
on='index')from shapely.geometry import Polygon, Point, LineString
import geopandas
import contextily as ctx
# Create Point geometries for each station (lon, lat order)
station_node_list = [Point(lon, lat) for lat, lon in station_df[['lat', 'lon']].to_numpy()]
stations_gdf = geopandas.GeoDataFrame(station_df.copy(), geometry=station_node_list, crs="EPSG:4326")
# Create edges as LineStrings between all unique pairs of stations
edge_list = []
n = len(station_node_list)
for i in range(n):
for j in range(i + 1, n):
edge_list.append(LineString([station_node_list[i], station_node_list[j]]))
edges_gdf = geopandas.GeoDataFrame(geometry=edge_list, crs="EPSG:4326")
# Plot everything
fig, ax = plt.subplots(figsize=(18, 10))
# Plot edges
edges_gdf.plot(ax=ax, color="xkcd:blue", linewidth=1, alpha=1)
# Plot nodes
stations_gdf.plot(ax=ax, color="xkcd:bright orange", markersize=10)
# Add station labels
for i, row in stations_gdf.iterrows():
ax.text(row.geometry.x, row.geometry.y, row['station'], fontsize=9)
# Add basemap
ctx.add_basemap(ax, crs=stations_gdf.crs.to_string())adj = modified_adjacency_df.drop('station', 'index').to_numpy()
adj = adj / adj.sum(axis=1, keepdims=True) # row normalize
def spatial_impute(data, adj):
imputed = data.copy()
T, N, F = data.shape
for t in range(T): # time index
for i in range(N): # station index
for f in range(F): # feature index
if np.isnan(imputed[t, i, f]):
# Get neighbor values and weights
neighbor_vals = imputed[t, :, f]
weights = adj[i]
mask = ~np.isnan(neighbor_vals)
if mask.sum() > 0:
imputed[t, i, f] = np.dot(weights[mask], neighbor_vals[mask]) / weights[mask].sum()
return imputed
def spatiotemporal_impute(data, adj):
data = spatial_impute(data, adj)
# fill remaining NaNs using temporal interpolation
T, N, F = data.shape
for i in tqdm(range(N)):
for f in range(F):
series = data[:, i, f]
mask = ~np.isnan(series)
if mask.sum() == 0:
continue # entire feature missing, can't interpolate
# Linear interpolation
indices = np.arange(T)
data[:, i, f] = np.interp(indices, indices[mask], series[mask])
return data
data_imputed = spatiotemporal_impute(station_matrix, adj)
0%| | 0/7 [00:00<?, ?it/s]
100%|##########| 7/7 [00:00<00:00, 4657.38it/s]
import pandas as pd
pd.DataFrame(data_imputed[:200, :, 2]).plot(legend=False, subplots=True, figsize=(20, 8))array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
pd.DataFrame(station_matrix[:200, :, 2]).plot(legend=False, subplots=True, figsize=(20, 8))array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
def node_correlation(data, node_number):
T, N, F = data.shape
# Store correlation matrices per node (features x features)
corr_within_node = np.empty((N, F, F))
for node in range(N):
# Slice data for node: shape (T, F)
node_data = data[:, node, :]
# Compute correlation matrix (features x features)
corr_within_node[node] = np.corrcoef(node_data, rowvar=False)
# Example: correlation between feature 0 and feature 1 at node 0
corr_list = []
for f_feature in range(F):
inner_list = []
for s_feature in range(F):
if f_feature < s_feature:
inner_list.append(np.nan)
else:
inner_list.append(float(corr_within_node[node_number, f_feature, s_feature]))
corr_list.append(inner_list)
return corr_list
def between_node_correlation(data, feature_number):
feature_data = data[:, :, feature_number]
corr_between_nodes = np.corrcoef(feature_data.T)
corr_between_nodes[np.triu_indices(corr_between_nodes.shape[0], 1)] = np.nan
return corr_between_nodesfig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()
for i in range(data_imputed.shape[1]):
corr_matrix = pd.DataFrame(
node_correlation(data_imputed, i),
columns=valid_features,
index=valid_features
)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, ax=axes[i])
axes[i].set_title(f'{valid_station[i]} Correlation')
plt.tight_layout()
plt.show()fig, axes = plt.subplots(2, 4, figsize=(20, 10)) # 2 rows, 4 columns grid
axes = axes.flatten()
for i in range(data_imputed.shape[2]):
corr_matrix = pd.DataFrame(
between_node_correlation(data_imputed, i),
columns=valid_station,
index=valid_station
)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, ax=axes[i])
axes[i].set_title(f'Between-Node Correlation: {valid_features[i]}')
plt.tight_layout()
plt.show()from sklearn.preprocessing import robust_scale
T, N, F = data_imputed.shape
imputed_df = pl.from_numpy(data_imputed.reshape(T * N, F), schema=valid_features)\
.drop('dwpf', 'feel')\
.with_columns(pl.DataFrame([valid_station[i % len(valid_station)] for i in range(T*N)], schema={'station': pl.Categorical})\
.join(pl.from_pandas(station_df),
on='station',
how='left'))\
.select(pl.col(['station', 'lon', 'lat', 'elevation', 'tmpf', 'relh', 'sknt', 'drct_sin', 'drct_cos']))\
.drop('lon', 'lat', 'elevation')\
.with_columns(pl.col('relh')/100)
imputed_df = imputed_df.drop('tmpf', 'sknt')\
.hstack(pl.DataFrame(robust_scale(imputed_df.select(pl.col('tmpf'))), schema=['tmpf']))\
.hstack(pl.DataFrame(robust_scale(imputed_df.select(pl.col('sknt'))), schema=['sknt']))\
.select(pl.col(['station', 'tmpf', 'relh', 'sknt', 'drct_sin', 'drct_cos']))
imputed_df| station | tmpf | relh | sknt | drct_sin | drct_cos |
|---|---|---|---|---|---|
| cat | f64 | f64 | f64 | f64 | f64 |
| "GCK" | -1.355721 | 0.481483 | -0.080357 | 0.851117 | 0.524977 |
| "LBL" | -1.265174 | 0.52015 | 0.160714 | 0.664796 | 0.747025 |
| "EHA" | -1.168491 | 0.633828 | -0.285714 | 0.970763 | 0.24004 |
| "HQG" | -1.205638 | 0.484683 | -0.223214 | 0.936332 | 0.351115 |
| "3K3" | -1.241791 | 0.531278 | -0.383929 | 0.981255 | 0.192712 |
| … | … | … | … | … | … |
| "EHA" | -0.368491 | 0.349556 | -0.991072 | -0.533205 | -0.845986 |
| "HQG" | -0.417247 | 0.322311 | -1.205357 | 0.977334 | -0.211704 |
| "3K3" | -0.432836 | 0.313778 | -0.741072 | -0.700217 | -0.71393 |
| "JHN" | -0.417579 | 0.385544 | -0.705357 | -0.824675 | -0.565607 |
| "19S" | -0.441128 | 0.365522 | -0.491072 | -0.824675 | 0.565607 |
# if the station is going to be used for linear regression apply a one-hot encoding to it
imputed_pd_df = imputed_df.to_pandas()
imputed_pl_df = imputed_dfExplain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.
Summarize your key findings.
Discuss the implications of your results.